Hodge theory-based biomolecular data analysis

Hodge theory reveals the deep intrinsic relations of differential forms and provides a bridge between differential geometry, algebraic topology, and functional analysis. Here we use Hodge Laplacian and Hodge decomposition models to analyze biomolecular structures. Different from traditional graph-based methods, biomolecular structures are represented as simplicial complexes, which can be viewed as a generalization of graph models to their higher-dimensional counterparts. Hodge Laplacian matrices at different dimensions can be generated from the simplicial complex. The spectral information of these matrices can be used to study intrinsic topological information of biomolecular structures. Essentially, the number (or multiplicity) of k-th dimensional zero eigenvalues is equivalent to the k-th Betti number, i.e., the number of k-th dimensional homology groups. The associated eigenvectors indicate the homological generators, i.e., circles or holes within the molecular-based simplicial complex. Furthermore, Hodge decomposition-based HodgeRank model is used to characterize the folding or compactness of the molecular structures, in particular, the topological associated domain (TAD) in high-throughput chromosome conformation capture (Hi-C) data. Mathematically, molecular structures are represented in simplicial complexes with certain edge flows. The HodgeRank-based average/total inconsistency (AI/TI) is used for the quantitative measurements of the folding or compactness of TADs. This is the first quantitative measurement for TAD regions, as far as we know.

www.nature.com/scientificreports/ Topological representations for biomolecules. The biomolecular topology is of essential importance for biomolecular flexility, dynamics and functions. For instance, molecular dynamic (MD) force field involves geometric/topological features, such as bond length, bond angle, dihedral angle, and other graph-based properties 64 . In fact, graphs or networks are the most frequently used models for the representation of molecular structures from materials, chemistry and biology 3,4 . Mathematically, a graph G = (V , E) contains the vertex set V = {v i : 1 ≤ i ≤ N} and the edge set E = {(v i , v j ) : 1 ≤ i, j ≤ N} . Generally speaking, graph representations only characterize the zero and one-dimensional information within the structure. A simplicial complex is a generalization of graphs into their higher dimensional counterpart. The most commonly used simplicial complexes are triangle meshes and tetrahedron meshes. A general simplicial complex is composed of simplexes. Based on simplicial complexes, various algebraic groups, boundary operators, and Hodge Laplacian matrices can be defined. Let d, k be any two positive integers and U = {u 0 , u 1 , . . . , u k } be a collection of points in R d . We say that this collection of points are affinely independent if the set {u i − u 0 } k i=1 is linearly independent. A point x ∈ R d is said to be an affine combination of points in U if it can be written as a linear combination of points in U whose coefficients sum to 1, that is, for some i ∈ R and k i=0 i = 1. If i ≥ 0 also holds, then x is said to be a convex combination of points in U. The convex hull of U is the set of all convex combinations of points in U. The fundamental building blocks of simplical complex are simplexes.
Let U = {u 0 , u 1 , . . . , u k } be an affinely independent set of k + 1 points in R d . A k-simplex σ k is the convex hull of U, denoted by [u 0 , u 1 , . . . , u k ] . The dimension of σ k is k. Geometrically, a 0-simplex is simply a point, an 1-simplex is called an edge, a 2-simplex is called a triangle and a 3-simplex is called a tetrahedron. A face τ of a k-simplex σ k is the convex hull of a non-empty subset A of U, denoted by τ ⊂ σ k . An oriented k-simplex is a k-simplex with an orientation, i.e., a sequence arrangement of its vertices. If two k-simplexes σ k 1 and σ k 2 are of the same orientation, they are denoted as σ k 1 ∼ σ k 2 . Two simplexes σ k 1 and σ k 2 are upper adjacent and denoted as σ k 1 ⌢ σ k 2 , if they are faces of a common (k + 1)-simplex, and they are lower adjacent and denoted as σ k 1 ⌣ σ k 2 , if they share a common (k − 1)-simplex as their face. For these two oriented k-simplices, if the orientations of their common lower simplex are the same, they are called a similar common lower simplex and denoted by σ k 1 ⌣ σ k 2 and σ k 1 ∼ σ k 1 . Otherwise, it is called a dissimilar common lower simplex and denoted by σ k 1 ⌣ σ k 2 and σ k 1 ∼ σ k 2 . The (upper) degree of a k-simplex σ k , denoted by d(σ k ) , is the number of (k + 1)-simplices of which σ k is a face.
A simplicial complex K is a finite collection of simplices that satisfy two conditions. Firstly, any face of a simplex in K is also in K. Secondly, the intersection of any two simplices in K is either empty or a face of both. A simplicial k-complex is a simplicial complex where the largest dimension of simplices in K is k. Figure 1 illustrates the comparison between a graph and a simplicial complex for a protein (ID:2OFS). For the graph model, vertices represent molecular atoms and the edges are for covalent-bonds. The simplicial complex is constructed using Vietoris-Rips complex. Essentially, a cutoff-distance of 4.0Å is used and a k-simplex is formed among k + 1 vertices whose pair-wise distances are all smaller than 4.0Å. A simplicial complex can be viewed as a generalization of graph into its higher dimensional counterpart. Vertices (0-simplexes) and edges (1-simplexes) from graphs can be extended to higher dimensional elements, including triangles (2-simplexes) and tetrahedrons (3-simplexes

Hodge Laplacian and Hodge decomposition
Hodge Laplacian matrices of different dimensions can be constructed on a simplicial complex. A k-th dimensional HL matrix characterizes topological connections between k-th simplexes. Note that the graph Laplacian, which is 0-th dimensional HL, characterizes relations between vertexes (0-simplexes).

Hodge Laplacian model. Mathematical background for Hodge Laplacian.
The kth chain group C k (K) of a simplicial complex K over some field F is a vector space over the F whose basis is the set of k-simplices of the simplicial complex K. Elements of C k (K) are called k-chains. The dual of C k (K), denoted by C k (K), is the set of all linear functionals on C k (K): is called the k-th cochain group and its elements are called k-cochains. Boundary operators are defined on both the chain and cochain groups. The boundary map ∂ k : C k (K) → C k−1 (K) is a linear transformation which acts on a k-simplex σ k = [u 0 , u 1 , . . . , u k ] as follows The coboundary map δ k : C k (K) → C k+1 (K) is a linear transformation defined as follows: for a linear functional φ ∈ C k (K) and a k + 1-simplex The boundary map gives rise to a chain complex, which is a sequence of chain groups connected by boundary maps as follows: Similar to the boundary map giving rise to the chain complex, the coboundary operator gives rise to a cochain complex: Since C k (K) and C k (K) are finite-dimensional, there exists unique matrix representations for ∂ k and δ k . We have some useful relations regarding matrix representations of ∂ k and δ k ( A T represents the transpose of a matrix A): Here, δ * k : C k+1 (K) → C k (K) is the adjoint/transpose map of δ k where for every f ∈ C k (K) , g ∈ C k+1 (K) and a suitable inner product , for C k (K) and C k+1 (K). The adjoint of the boundary operator ∂ k , ∂ * k is also defined analogously. These relations above allow us to work unilaterally from the boundary operator's perspective, which is the easiest to compute amongst the two.
The k-dimensional combinatorial Laplacian is the linear operator � k : C k (K) → C k (K) is defined as follows: The case where k = 0 gives rise to the expression of the well-known graph Laplacian.

Discrete Hodge Laplacian.
The boundary operator ∂ k has a unique matrix representation. Given a simplicial complex K, the k-th boundary matrix B B B k is defined as, Here σ k−1 i is the i-th (k − 1)-simplex and σ k j is the j-th k-simplex. Given that the highest order of the simplicial complex K is n, the kth Hodge Laplacian (or combinatorial Laplacian) matrix L L L k of K is www.nature.com/scientificreports/ These k-th HL matrices can also be expressed in terms of simplex relations. When k = 0, The HL matrix L L L 0 is the graph Laplacian matrix. When k > 0, Mathematically, the eigenvalues of HL matrices are independent of the choice of the orientation 18 .
Hodge decomposition model. Hodge  Geometrically, the cochain C 1 (K) can be viewed as edge flows as it consists of all scalar functions on the 1-simplices (edges). The term ker(δ 1 ) can be regarded as gradient flows, term ker(� 1 ) can be regarded as harmonic flows, and term Im(δ * 1 ) can be regarded as curl flows.

Discrete Hodge decomposition and HodgeRank. Computationally, Hodge decomposition-based
surface vector field analysis 14 have received a lot of attention, and found various applications in geometric processing, computer graphs, and fluid dynamics analysis. Different from these 2D surface or 3D domain-based vector decomposition models, a simplicial complex-based Hodge decomposition model, known as HodgeRank, has been proposed for statistical ranking 25 . In HodgeRank 25 , an edge flow value Y on an edge is regarded as a ranking order, that is if the flow goes from vertex i 1 to vertex i 2 , then the score is higher at i 1 than i 2 (as flow goes from higher "place" to lower "place"). The edge flow from vertex i 1 to vertex i 2 is denoted as Y [i 1 ,i 2 ] . If the rank value for i 1 is a scale with value f i 1 and for i 2 In this way, gradient flows Y g are globally consistent in terms of ranking, as they always go from higher values to low values 25 . In contrast, harmonic flows Y h and curl flows Y c are inconsistent in ranking models 25 . In both terms, the flows can travel from one vertex to some other vertices and then return to the same exact vertex. This is problematic for ranking, as it means a "large" value can keep on decreasing to "small" values, but still return to the same value. The harmonic flows are globally inconsistent and curl flows are locally inconsistent.
Given the edge flow values Y, HodgeRank gives the gradient flow term Y g , the curl flow term Y c , and the harmonic flow term Y h . The detailed algorithm for Hodge decomposition is listed in Algorithm 1.  here N is the total number of 0-simplexes (vertices) in the simplicial complex. We also note that the TI/AI indices does not depend on the ordering of vertices, as the number of edges and triangles in a Vietoris-Rips simplicial complex with a fixed threshold distance γ is invariant under the renumbering of data points, and the set of values in the vectors of curl and harmonic flows are each uniquely determined by γ.
Hodge-theory-based biomolecular structure analysis. We use Hodge Laplacian and Hodge decomposition models to analyze biomolecular structures. Both homological and non-homological eigenvectors from Hodge Laplacian can be used in the different types of spectral clustering. The Hodge decomposition-based Hodgerank model can be used in the systematic characterization of biomolecular folding and compactness, in particular, the analysis of TAD regions from Hi-C data.
Hodge Laplacian-based biomolecular structure analysis. The multiplicity of the zero eigenvalue, i.e., the total number of zero eigenvalues, of L k is the k-th Betti number β k . Geometrically, β 0 is the number of connected components, β 1 is the number of circles or loops, and β 2 is the number of cavities. Moreover, the zero eigenvalue related eigenvectors are related to homology generators. They can be used to identify the associated topological features, such as circles, loops, and voids in the structures. The eigenvectors from nonzero eigenvalues are related to clusters and communities within the data, and can be used for spectral clustering. Figure 2 illustrates L 1 -based eigenvectors for Guanine structures. The absolute values of L 1 eigenvectors are plotted on the edges and represented by colors.Two homology generators, i.e., eigenvectors from the zeroeigenvalue, are considered. It can be seen that for each homology generator, their largest absolute values are all concentrated around a loop structure, which is either a pentagon ring or a hexagon ring. In contrast, for the two (non-homological) eigenvectors that are from the non-zero-eigenvalues, their values can be used to identify domains or clusters.
HodgeRank-based biomolecular structure analysis. Recently, Hodge decomposition for vector fields over 3D bounded domains has been systematically explored and been applied to biomolecular dynamic analysis 33,34 .
The essential idea is to use HodgeRank-based TI/AI indices as a way to measure the folding, curvedness and compactness of the biomolecular structures. In our model, edge flows represent distance relations between biomolecular atoms. Note that biomolecular atoms have a unique ordering or sequence. For instance, DNAs are www.nature.com/scientificreports/ double helix structures from the gene sequence, and proteins are from peptide sequences. The gene or peptide sequence provides a natural ordering of the atoms in a biomolecule. In this way, even though the biomolecules have highly complicated 3D structures, their atoms, in particular the backbone atoms, can be systematically arranged into a unique sequence (following their gene sequence). Furthermore, the inconsistence from edge flows can be used to model Euclidian distances deviated from the straight lines. For two vertices i 1 and i 2 with coordinate r i 1 and r i 2 , the edge flow Y [i 1 ,i 2 ] is defined as, Note that edge flows are always positive if they follow the chain sequence. More specifically, if vertex i 2 comes later than i 1 along the chain sequence, then Y [i 1 ,i 2 ] is always positive, otherwise the edge flow is negative. Motivated by the triangle inequality definition, we propose to use local inconsistence to measure the curvedness of the biomolecular chains. More specifically, if three vertices i 1 , i 2 and i 3 are located in a straight line, we should always have the sum meaning there is no curvedness or folding. In contrast, if the sum is nonzero, there will be a deviation from the straight line. More generally, if the whole chain is a straight line, the edge flows defined above will only have gradient terms. Both harmonic flows and curl flows will be zero. In contrast, if a chain is folded, the harmonic flows and curl flows are nonzero and can be used to characterize the curvedness, folding and compactness of structures. In Fig. 3, we illustrate different flow terms of the simplicial complexes generated from an partially-folded protein structure (details in "Protein folding analysis"). It can be seen that the large-valued curl flow terms are all concentrated in the highly-packed or folded regions. The harmonic flow terms are all zero as there is no 1D harmonic circles in the simplicial complexes. It is worth mentioning that the curl flow terms are only defined on 2-simplexes (triangles), thus there will be no curl flow terms if there is no 2-simplexes.

Results
In this section, we apply Hodge Laplacian and HodgeRank models into biomolecular data analysis and Hi-C data analysis. HL-based eigenvectors are used to reveal cycle or loop structures within molecules. Furthermore, Hodge decomposition-based TI/AI indices are used for protein, DNA and chromatin folding analysis.
Hodge-theory-based biomolecular data analysis. HL-based biomolecular structure analysis. The representation and characterization of biomolecular structures are of great importance for analyzing biomolecular functions. Among the various structural properties are biomolecular topological features, including rings, channels, cages, voids, etc. For instance, the closing and opening of ion channels are highly related to the channel structures. The virus capsids are cage-like structures with high symmetries. All these topological information can be well characterized by homology generators. Mathematically, eigenvectors of the zero eigenvalues of the kth HL matrices are kth homology generators. Only C α atoms are considered and the protein configuration is taken from the SMD simulation 65 . The simplicial complex is generated with a cutoff at 11 Å. It can be seen that most of curl terms with larger values are concentrated near highly-packed regions. All harmonic terms are zero, since there is no harmonic flows (no 1D harmonic circles in the simplicial complex). Protein folding analysis. Here we use HodgeRank-based inconsistence to quantitatively measure the folding of protein configurations. We consider the Titin molecule. The trajectory data is obtained from the Steered molecular dynamics (SMD) simulation 65 . SMD simulations are designed to study the protein folding mechanism through an inverse unfolding process 65 . Essentially, a constant force or velocity is applied to one end of protein (with the other end fixed) to unfolded into a straight chain. In this way, various metastates can be observed from the dynamic process. We take 97 configurations equally from the simulation trajectory and renumber the sequence so that the last configuration (which is the straight chain) comes first and the first configuration (initial well-folded structure) comes last. Eight of these configurations are plotted in Fig. 5. Only C α atoms are considered. It can be seen that, after renumbering, a protein folding process from a straight peptide chain to a well-folded 3D structure is observed. From these protein configurations, we can construct a series of simplicial complexes using the Vietoris-Rips complex. Figure 5 illustrates eight simplicial complexes generated from eight different Titin configurations. A cutoff distance of 11 Å is used to generate the Vietoris-Rips complex.
TI is used to measure the folding of protein structures. Since the Titin molecule has only a single chain, we take all the C α atoms and rank them according to their amino acid sequence numbers. In this way, for two atoms i 1 and i 2 ( i 1 < i 2 ) with coordinate r i 1 and r i 2 , if there exists an edge between them in a simplicial complex, their edge flow Y [i 1 ,i 2 ] = |r i 1 − r i 2 | according to Eq. (3). Furthermore, we can use the HodgeRank model and calculate TI for each configuration. Other than the cutoff distance of 11Å, we also consider other cutoff distances from 8Å to 14Å. Figure 6 illustrates the TIs of the 97 Titin configurations during the SMD simulation. As mentioned above, the renumbering is considered so the very first configuration corresponds to the straight line structure at the very end of SMD simulation. It can be seen that when Titin folds from a peptide chain to its 3D structure, the TIs increase monotonically with only small fluctuations. When Titin is a long unfolded peptide chain, TI is 0 as there is no curvedness or folding in the structure. The largest TI is obtained when Titin is well folded into its 3D structure. Moreover, with the enlargement of cutoff distance, the corresponding TI increases. This is due to the increasing size of associated simplicial complexes as cutoff distance increases. A larger cutoff distance ensures that the relations between atoms that are far away from each other are still well considered. More importantly, it can be seen clearly that as the increase of TI value, the fluctuations become smaller and smaller, and the TI curve becomes smoother. Even though a larger cutoff distance is preferred, the computational cost increases dramatically. Therefore, in our calculations, we do not consider a fully-connected simplicial complex, i.e., any k + 1 atoms forming a k-simplex, instead, a median-sized cutoff distance is used. However, our TI still provides a suitable quantitative measurement for protein folding. Note that all these protein configurations have the same amount of atoms, therefore their corresponding AIs are of the same pattern as TIs. It is worth mentioning that the Hodg-eRank is based simplicial complex representation, if there is no 2-simplexes, all the curl flow terms will be zero.
DNA and chromatin folding analysis. We consider the folding of DNA at the atomic level. Three different DNA structures, including DNA helix, nucleosome and tetranucleosome (part of chromatin), are used in our analysis. Topologically, DNA helix is the preliminary structure, and can be folded into nucleosome and further into tetranucleosome. We consider only the phosphorus atoms in the three molecules and construct their simplicial complexes using Vietoris-Rips complex. The total numbers of phosphorus atoms for DNA helix, nucleosome and tetranucleosome, are 22, 291 and 692, respectively. Six different cutoff distances are used, including 10 Å, 12 Å, 14 Å, 16 Å, 18 Å and 20 Å. The DNA structures and the associated simplicial complexes are illustrated in  Table 1. Since the simplicial complex for the DNA helix structure at 10 Å has no 2-simplexes, there is no curl flow terms, i.e., all the Y c terms in Eqs. 1 and 2 are zero. Similarly, the Y h terms are also zero. In this way, the TI and AI for the DNA helix structure are all very close to 0. Similarly, at a cutoff of 12Å, both TI and AI are very close to 0 as no 2-simplexes are generated in DNA helix. Due to the folding of DNA chains, the TIs and AIs for both nucleosome and tetranucleosome structures are nonzero. Moreover, TIs for tetranucleosome are consistently larger that those of nucleosome. However, AIs for tetranucleosome are smaller that those of nucleosome at 10 Å. This is due to the reason that our AI is the average TI over the total number of atoms. From Fig. 7, it can be seen that the proportion of 2-simplexes over the total atom number for tetranucleosome is smaller than that of nucleosome, due to the missing 2-simplexes in the center linkage region. With the increase of cutoff distance, well-connected simplicial complexes are generated. The monotonic increase of TIs and AIs from DNA helix to nucleosome, and to tetranucleosome, can be observed clearly. There are highly consistent with the DNA folding patterns, indicating that both our TI and AI models are suitable for the description of curvedness, folding and compactness of biomolecular structures at molecular level.
Hodge decomposition-based Hi-C data analysis. Chromosomes have complicated hierarchial structures. Based on the analysis of Hi-C structures, it is believed that there are two possible types of structures (domains, subregions, etc), i.e., topologically associating domains (TADs) and genomic compartments. Computationally, TAD is defined to be the square region along the diagonal Hi-C maps with large contact values and a size of about 200 kilobases (Kb) to 2 megabases (Mb). Biologically, larger contact values mean these chromosomal loci (specific fixed positions on a chromosome) are close to each other, i.e., they are within a certain highly compacted/folded region. Figure 8 illustrates TAD regions in a Hi-C data. Geometrically, each TAD region (cartoon representation, not realistic experimental results) is believed to be a highly-packed region. The black dash lines mark the boundaries of TADs. However, the TAD is not mathematical rigorously defined, as it is not always easy to clearly identify the so-called "square regions". For instance, its is also reasonable to believe that the two TADs in the middle region can be aggregated into one TAD. Due to the highly complicated hierarchial structure of chromosome, various algorithms have been developed to provide an approximation or estimation of TADs 52,60-63 , but a rigorous definition of TAD remains to be a problem.
In this subsection, we use HodgeRank-based AI index in the quantitative measurement topological associating domains calculated from Hi-C. Computationally, the entry M i,j of a Hi-C matrix M represents the contact frequencies between the i-th and j-th loci of the genome. The higher the contact frequencies between the ith and jth loci of the genome, the higher the probability that these two loci are closer to each other. Computationally, the distance d(i, j) between two loci i and j can be modeled as the α-reciprocal of contact frequency, Figure 6. The illustration of total inconsistency (TI) for Titin configurations during the SMD simulations. A renumbering is considered so the first configuration is the last one in the SMD simulation. It can be seen that TIs are monotonically increasing when protein folds from a straight line to its 3D structures. Here α is the power term and is usually chosen from the range of (0, 1). In our model, we consider the α value to be α = 0.25 and two cutoff distance γ values, i.e., γ = 0.4 and γ = 0.5 , to construct Hi-C matrix-based simplicial complexes. More specifically, if d(i, j) is smaller than the cutoff distance, an 1-simplex (edge) is formed between vertices i and j. Similarly, a 2-simplex (edge) is formed among three vertices i, j, and k if d(i, j) < γ , d(i, k) < γ , and d(j, k) < γ . Since d(i, j) is not the direct experimental measurement for the distance between two loci, we consider two different types of the edge flows. The first type of edge flow is defined based on distance d(i, j) as follows,  We call these two models as distance-based edge flow model and constant edge flow model respectively.
To test the performance of our two HodgeRank models for TAD analysis, we consider TAD regions obtained from human ES (embryonic stem) cells chromosome 10, using directionality index segmented by a Hidden Markov Model (HMM) 52 . The data has a resolution of 40,000 bp (base pairs) or 40 kb, i.e. each locus has a size of 40,000 bp. Six TAD regions are selected and depicted in Fig. 9. The values of contact frequency are represented by colors. A bright yellow color indicates a higher contact frequency, thus a short distance (between the two loci). We systematically evaluate the AIs for all six TAD regions using two edge flow models under two different cutoff distances as stated above. The results are listed in Table 2. To avoid confusion, TAD (a) to TAD ( f) are TAD regions as illustrated in Fig. 9, respectively. It can be seen that even though we use two different edge flow models, the pattern for AIs are highly consistent. That is the AI value monotonically decreases from TAD (a) to TAD (f), except for TAD( a) and TAD( b) at γ = 0.5 . In fact, TAD ( b) has a larger AI value than TAD ( a) in both edge flow models. Mathematically, there is no rigorous model to quantitatively measure the folding of TAD regions. However, if there are more larger contact frequency values, the loci are closer to each other (note that two adjacent loci has same distance), thus the TAD is more compact or folded. It can be observed that our AI values are highly consistent with the TAD patterns as seen in Fig. 9. Further, we consider six different non-TADs. These non-TAD regions are obtained from diagonal regions with lower contact values. Figure 10 illustrates these non-TAD regions and their AI values are listed in Table 3. It can be seen that lower AI values are systematically found for non-TADs than those for TADs.

Conclusion
Hodge theory characterizes the deep intrinsic relations of differential forms and provides a bridge between various areas in mathematics, including differential geometry, algebraic topology, and functional analysis. Here we considered both the Hodge Laplacian model and Hodge decomposition-based HodgeRank model for biomolecular data analysis. The HL-based spectral information, in particular, eigenvectors, are used for protein and DNA structure characterization. More specifically, homology generators are used for cycle and loop structure  Note that it is not always easy to clearly identify these "square regions". For instance, it is also reasonable to believe that the two TADs in the middle region can be aggregated into one TAD. www.nature.com/scientificreports/ characterization. Non-homology related eigenvectors are used in clustering and community detection. Furthermore, we used the total and average inconsistency index from HodgeRank model to characterize the folding, compactness or curvedness of biomolecular structures and topological associated domains in Hi-C data. It has been found that our model can be used to quantitatively measure the folding within TADs. In the future, we will further explore the application of our HL-based clustering/classification, in particular, the homology-based and higher-order-simplex-based clustering/classfication. Moreover, we will study the relation of our AI values with genomic features such as histone modifications, coordinated gene expression, lamina, and DNA replication timing.

Data availability
All the data and codes in the paper are available at https:// github. com/ Expec tozJJ/ Hodge-Theory.  Table 2, the AI values for TADs (a-f) are consistently decreasing, which is highly consistent with TAD patterns. We have also demonstrated six non-TADs in Fig. 10 and Table 3. It can be seen from the comparison that TADs tend to have more largercontact values and their AI values are systematically larger than those from non-TADs. Table 2. HodgeRank-based TAD analysis. The AI values are used for the quantitative measurement of the folding within TAD. A larger AI value indicates more loops and high compactness of TAD region. In contrast, a lower AI value means less folding and less loops within TAD. Two different edge flow models, i.e., distance-based edge flow as in Eq. (4) and constant edge flow as in Eq. (4), are considered. Two different cutoff distances, i.e., γ = 0.4 and γ = 0.5 , are used to construct Hi-C matrix-based simplicial complexes. Here TAD(a) to TAD(f) are TAD regions as illustrated in Fig. 9(a-f), respectively. For both models with two cutoff distance, the AI values decrease monotonically from TAD(a) to TAD(f), except a small inconsistency for at TAD( a) and TAD( b) at γ = 0.5.   The contact frequency values are represented by colors. Bright yellow color indicate higher contact frequency, thus a short Euclidean distance between the two loci. As listed in Table 3, the AI values for non-TAD regions (a-f) are dramatically low as compared to the AI values for TADs. Table 3. HodgeRank-based analysis on regions that are non-TADs. By non-TADs, we refer to the regions where contact frequencies are less or the regions where TADs are mostly not part of the region. The AI values are used for the quantitative measurement of the folding within non-TAD regions. Generally, low AI values were recorded due to the region having less contact frequency as compared to the AI values of TADs. Two different edge flow models, i.e., distance-based edge flow and constant edge flow, are considered. Two different cutoff distances, i.e., γ = 0.4 and γ = 0.5 , are used to construct Hi-C matrix-based simplicial complexes. Here non-TAD(a) to non-TAD(f) are non-TAD regions as illustrated in Fig. 10a-f, respectively.